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Abstract 

Background: The accessibility of high-throughput genotyping technologies has contributed greatly to the development 
of genomic resources in non-model organisms. High-density genotyping arrays have only recently been developed for 
some economically important species such as conifers. The potential for using genomic technologies in association 
mapping and breeding depends largely on the genome wide patterns of diversity and linkage disequilibrium in 
current breeding populations. This study aims to deepen our knowledge regarding these issues in maritime 
pine, the first species used for reforestation in south western Europe. 

Results: Using a new map merging algorithm, we first established a 1,712 cM composite linkage map 
(comprising 1,838 SNP markers in 12 linkage groups) by bringing together three already available genetic maps. 
Using rigorous statistical testing based on kernel density estimation and resampling we identified cold and hot 
spots of recombination. In parallel, 186 unrelated trees of a mass-selected population were genotyped using a 
12k-SNP array. A total of 2,600 informative SNPs allowed to describe historical recombination, genetic diversity 
and genetic structure of this recently domesticated breeding pool that forms the basis of much of the current 
and future breeding of this species. We observe very low levels of population genetic structure and find no evidence 
that artificial selection has caused a reduction in genetic diversity. By combining these two pieces of information, we 
provided the map position of 1,671 SNPs corresponding to 1,192 different loci. This made it possible to analyze 
the spatial pattern of genetic diversity {He) and long distance linkage disequilibrium (LD) along the chromosomes. 
We found no particular pattern in the empirical variogram of Hq across the 12 linkage groups and, as expected for an 
outcrossing species with large effective population size, we observed an almost complete lack of long distance LD. 

Conclusions: These results are a stepping stone for the development of strategies for studies in population genomics, 
association mapping and genomic prediction in this economical and ecologically important forest tree species. 
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Background 

Conifers represent an ancient and widespread lineage of 
about 650 species [1,2]. They are of immense ecological 
and economic importance as they dominate many ter- 
restrial landscapes and are primarily used for timber and 
paper production worldwide. Domestication of some of 
these species started in the mid 1950'^^ with mass selection 
of outstanding genotypes in natural forests [3]. Genetic im- 
provement programs resulted in advances in biomass pro- 
duction, wood quality and resistance to biotic and abiotic 
stresses. However, traditional breeding has remained a slow 
process because of long generation intervals and because 
most traits cannot be correctly evaluated at an early devel- 
opmental stage. The application of genomic techniques in 
crop [4] and animal [5,6] breeding has resulted in 
more powerful methods for genetic evaluation, and re- 
cent advances in conifer genomics [6-8] have allowed 
tree breeders to use these tools and methodologies 
(namely association mapping and genomic prediction) 
to dissect the genetic basis of phenotypic variability 
and to accelerate the breeding process of these long- 
lived organisms [9]. 

Knowledge about linkage disequilibrium (LD) measured 
by the squared correlation between two loci is important 
for applications of molecular markers in association map- 
ping and genomic prediction. The decay of LD over phys- 
ical and genetic distance determines the resolution and 
density of the markers required for association mapping 
[10,11]. A formal link between the power of association 
tests and LD was established [12], and has recently been 
generalized for structured populations with related geno- 
types [13]. LD also determines the accuracy of genomic 
estimated breeding values [14,15]. Indeed, the direct and 
inverse relationship between expected LD (r^) and popula- 
tion recombination rate (r^ = l/(4NeC +1)) has obvious 
consequences for genomic prediction, because both the 
training population size and marker density vary with Ne, 
the effective population size [16,17]. 

Previous studies of short-distance (physical) LD in 
conifers, including maritime pine [18,19], have shown 
that LD extends to only a few hundred to a few thou- 
sand base pairs (reviewed in [20]), but with consider- 
able variation between genes [21]. These results have 
led to the conclusion that millions of SNPs would be 
required for very high resolution of whole-genome 
scan association mapping approaches for forest trees,. 
Thus candidate gene-based approaches have been fa- 
vored and may prove the best option before sufficiently 
larger numbers of markers, covering the whole gen- 
ome, become available [22] as recently illustrated for 
fruit and forest trees [23], including maritime pine 
[24]. Considering about 32 thousand genes, with an 
average gene size of 3-3.5 kb, Pavy et al. [25] estimated 
that a total of 1.1-1.3 million SNPs would be required 



to cover the gene space of spruce at a rate of one SNP 
per 85 bp, which may in any case correspond to only a 
tiny fraction of the megagenome of this species. Only a 
few studies have examined the extent and genome- 
wide distribution of LD in conifers. Using physical in- 
formation from three random BAC clones, Moritsuka 
et al. [26] reported significant LD (surprisingly, extend- 
ing over a distance of 100 kb) in non coding regions of 
the Cryptomeria japonica genome, suggesting that re- 
combination rate may vary according to the nature (coding 
vs. non coding, low copy vs. repeated sequences) of DNA, 
as shown in angiosperms [27] and gymnosperms [28]. In 
the same species, Tsumura et al. [29] discovered that some 
loci showing divergence along environmental gradients and 
located in different linkage groups, displayed substantial LD, 
suggesting an effect of epistatic selection between these loci. 
To our knowledge, only one study in Pinus taeda [30] 
reported LD for 807 mapped SNPs and confirmed the 
assumption of independence between genetically linked 
loci. This study showed that only a handful of loci de- 
parted from this expectation, five of which were coseg- 
regating loci displaying a high degree of differentiation 
between populations. This pattern was attributed to the 
presence of a 'genomic island' of differentiation. 

The main objective of this paper was to describe LD 
pattern, level and structure of genetic diversity across 
the maritime pine genome. The result may provide base- 
line information for future genetic studies (association 
mapping, genomic selection) in this economically im- 
portant conifer. To this end, we first establish a high- 
density genetic linkage map by merging three existing 
SNP-based maps [31] using map merging approaches 
implemented in the software LPmerge [32,33] and Mer- 
geMap [34]. Then, a set of unrelated individuals in the 
first stage of domestication was genotyped with the 
mapped markers to describe the genome-wide history of 
recombination and estimate the level and structure of 
genetic diversity in this first generation breeding popula- 
tion. Based on knowledge on other forest tree species, 
we would expect high levels of genetic diversity, a lack 
of extended LD and limited population structure [22], 
whereas the applied mass selection might be expected to 
have decreased diversity around the loci underlying the 
selected target traits [35]. All of these effects would have 
important implications for association mapping [36] and 
genomic prediction in breeding [37]. 

Results 

Construction of a composite linkage map for maritime 
pine and distribution of recombination on chromosomes 

We used the following strategy to integrate the three 
linkage maps, G2F, G2M and F2, into a single composite 
map. First, intermediate composite maps were estab- 
lished for G2F-F2 and G2M-F2 because there were few 
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markers common to the G2F and G2M maps suitable 
for anchoring (25 in total), whereas 198 SNPs were com- 
mon to F2 and G2F maps and 240 SNPs were common 
to F2 and G2M maps (see [31]). We then calculated a 
final composite map from these two intermediate maps. 
It comprised 1,838 SNPs (1 SNP/contig) distributed 
along 12 LGs (corresponding to the haploid chromo- 
some number), with a minimum of 121 markers in LG8 
and a maximum of 186 markers in LG3. With LPmerge 
software, the 12 composite LGs covered a distance of 
1,712 cM, with individual LG lengths ranging from 115 
(LG12) to 182 cM (LGS), and a density of 1 SNP marker 
per 0.9 cM (Figure 1; Additional file 1). With MergeMap 
software, the LGs covered 1,850.5 cM, with a individual 
LG length ranging from 119 (LG12) to 182 cM (LG2) 
and a density of 1 SNP per cM. 

We compared the results generated by LPmerge and 
MergeMap methods, by carrying out Wilcoxon signed 
rank tests on two metrics: the linkage group length of 
the composite map, and the root mean square error 
(RMSE) calculated from the difference in map position 
(in cM), between each component map and the resulting 
composite map. Three hypotheses were tested: i) the 
map lengths obtained for the intermediate (or final) com- 
posite maps do not differ significantly between LPmerge 
and MergeMap; ii) The difference in RMSE between 
component (or intermediate composite) maps and the 
resulting intermediate (or final) composite map does 
not differ significantly between LPmerge and MergeMap, 
and iii) the RMSE for each component (or intermediate 
composite) map does not differ significantly from the 
intermediate (or final) composite map constructed with 
LPmerge, and similarly for MergeMap. MergeMap sys- 
tematically yielded longer maps than LPmerge, for both 
intermediate and final composite maps (Additional file 2). 
RMSEs were determined for each linkage group after the 
map merging process. Comparisons between the two pro- 
grams showed that MergeMap gave larger RMSEs than 
LPmerge (optimized for the K parameter) for intermediate 
composite maps, but that RMSEs were similar for the two 
programs after the final step of map merging (Additional file 2). 
Despite these differences, marker order was highly 
correlated (Spearman's rank R > 0.87, P < 0.0001), for all 
LGs, between the composite maps constructed with 
LPmerge and MergeMap. Finally, correlations between 
marker positions on parental maps (F2, G2F and G2M) 
and on the final composite map constructed with 
LPmerge were high (Spearmans rank R > 0.95, P < 
0.0001), indicating that the positions of the markers on 
the composite map were consistent with those on the 
corresponding source maps. 

A chi^-test (df=ll) was performed on the composite 
map, to determine whether genes were evenly distrib- 
uted between maritime pine chromosomes. With twice 
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Figure 1 Alignment of tlie composite linlcage maps (illustrated 
by LG1) obtained with LPmerge (LPM on the left) and 
MergeMap (MM on the right) software. The whole map is 
available in Additional File 1. 
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as many markers than in our first investigation [31] it 
was clear that the number of markers per LG did not de- 
viate significantly from a uniform distribution over the 
12 linkage groups {P = 0.65). In terms of the distribution of 
markers on individual chromosomes, we found that a 
density of at least 10 markers per bandwidth (P = 3.3 x 10"^^) 
was required for the identification of a recombination 
cold spot, whereas a density of at most three markers 
(P = 3.2 X 10"^^) characterized a hot spot for recombin- 
ation. Given these upper and lower limits, and consid- 
ering the stringent confidence interval defined for 
kernel density function, we identified 13 significant 
clusters of markers (in 8 LGs), corresponding to re- 
combination cold spots (Figure 2). It proved more dif- 
ficult to identify significant hot spots of recombination 
(we found only two). As reported in [31], hot spots are 
more genetically variable, and it is therefore more diffi- 
cult to detect them on a composite map maximizing 
the number of recombination events from individual 
crosses. Examination of the shape of the kernel density 
estimate revealed that seven linkage groups (LGl, 3, 5, 
6, 8, 11, 12) had three clear peaks, with locations con- 
sistent with the centromeric and telomeric regions. 
Compared to the study by Chancerel et al. [31] more 
rigorous statistical testing (using resampling to define 
confidence interval) certainly contributed to discard a 
number of false positives. However, one should not 
forget that the distribution of recombination is genet- 
ically variable, therefore by merging information from 
different genetic maps it is likely that only stable hot 
and cold spots across the studied genetic backgrounds 
were revealed. 

SNP-assay genotyping statistics for the first-generation 
breeding (FGB) population 

The mean call rate (percentage of valid genotype calls) 
was 92% for the FGB population. Two poorly perform- 
ing samples were identified by plotting the sample call 
rate against the 10% GeneCall score. Three pairs of trees 
were found to display identical genotypic information 
for the 2,600 SNPs and were therefore considered mis- 
labeled in the tree archive (Additional file 3a). All six 
trees were discarded. This left 186 trees for the analysis 
of population genetics parameters. In total, 2,600 SNPs 
were polymorphic (2,532 SNPs and 68 indels), corre- 
sponding to 1,706 contigs of the maritime pine unigene 
(PineContig_v2, [31]). We positioned 1,671 of these SNPs, 
corresponding to 1,192 different loci, on the composite 
map. The overall conversion rate (number of polymorphic 
SNPs or indels divided by the total number of SNPs or 
indels in the assay, i.e. 9,279 SNPs) was therefore 28%. 
In total, 2,605 of the 3,498 "failed" assays corresponded 
to SNPs and 893 to indels, whereas 1,162 of the 3,181 
monomorphic loci corresponded to SNPs and 2,019 



corresponded to indels. This increased the conversion 
rate to 40.2% for SNPs and decreased the rate for 
indels to 2.3%, indicating that indels should be avoided 
when designing an Infinium assay. A list of poly- 
morphic SNPs is available from the NCBI dbSNP data- 
base (http://www.ncbi.nlm.nih.gov/SNP) and is also 
provided in Additional file 4. 

Test for Hardy-Weinberg equilibrium, distribution of 
minor allele frequency and population structure analysis 

Significant departure from Hardy-Weinberg equilibrium 
was detected for 12 SNPs from the 2,600 polymorphic 
markers in the FGB population (5% type I nominal 
error). After Bonferroni correction for multiple tests 
(5%/2,474 independent tests, although they were not 
all independent, i.e. an experiment-wise type I error of 
0.002%) none of these SNPs yielded a value significantly 
different from the expected value. We can therefore con- 
sider that the percentages of each of the three SNP ge- 
notypes remained constant in what can be considered a 
large population, with random mating, without muta- 
tion, migration or natural selection. The minor allele fre- 
quency (MAF) distribution of these 2,600 SNPs is shown 
in Figure 3. A total of 106 SNPs presented a MAF<5%. 
The scatter plots of these rare SNP alleles were checked 
visually, one-by-one, with GenomeStudio genotyping 
software. In all cases, the clustering profile was con- 
firmed. This distribution is unlikely to reflect the true 
MAF distribution for SNPs in the studied population. 
Indeed, as pointed out in [31], in silico SNP detection 
based on the use of sequenced cDNA libraries introduces 
an ascertainment bias toward genes that are strongly 
expressed (as they are called from expressed sequence 
tags) and, probably, less polymorphic, due to the stringent 
cutoffs used: i) MAF>33% and coverage>10x, to prevent 
the selection of SNPs present at such low frequencies that 
they are likely to be the product of sequencing error, ii) 
ADT score > 0.75, to minimize the variability of the flank- 
ing region surrounding the targeted SNP, thereby increas- 
ing the likelihood of a successful Illumina Infinium assay. 
In addition, the MAF spectrum is likely to be shifted up- 
ward, with an underrepresentation of rare alleles not cap- 
tured due to the small size of the sample used to prepare 
the cDNA libraries. The possibility of such a bias should 
be borne in mind when making fijirther evolutionary infer- 
ences concerning the demographic and selective history of 
maritime pine populations. 

Population structure and relatedness between individ- 
uals are known to bias the estimation of LD [13]. In this 
study, the trees of the FGB population were selected 
from natural stands in the Landes forest, with a sam- 
pling method designed to ensure the sampling of unre- 
lated individuals. The observed patterns of pairwise 
relatedness (Additional file 3b) suggests that this objective 
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Figure 2 Kernel density estimators (left y-axis) of marlcer density (right y-axis) along each linkage group (x-axis in cM). The red curve 
corresponds to the kernel density estimator. The surrounding bandplot (in darl< blue) is the confidence interval of the kernel density estimator. 
The horizontal bandplot (in light blue) is the range of variation of marker density under a Poisson distribution. When the lower or upper limit of 
the confidence interval is above or below this range, we declare the presence of a significant cold or hot spot of recombination, respectively. 
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Figure 3 Minimum allele frequency (MAF, x-axis) distribution of each SNP in the first-generation breeding (FGB) population. 



was achieved. We tested for possible cryptic relatedness or 
differentiation, by performing principal component ana- 
lysis (PCA) on the full genotype matrix of 2,600 SNPs. A 
comparison of the size of the eigenvalues obtained with 
the Tracy- Widom distribution yielded two significant prin- 
cipal components. In theory, this could indicate the pres- 
ence of three distinct subpopulations, clustering on the 
basis of the first two PCs yielded three groups with very 
low levels of genetic differentiation (Fst 0.002-0.005). We 
plotted these individuals along the two significant PCs and 
found little evidence of separate clusters (Additional file 5a). 
Geographic analysis reveals a significant relationship 
(r^ = 0,17, p = 0.007) between genetic PCI and the major 
axis of geographic variation (mostly latitude), with some 
evidence of PC2 being associated with the second axis 
(mostly longitude) (r^ = 0.06, p = 0.11) (Additional file 5b,c,d). 
Overall, there was a weak, but significant pattern of iso- 
lation by distance (r = 0.2, p = 0.006) (Additional file 5e) 
rather than a division into distinct groups. This result 
was confirmed by the structure analysis performed with 
Structure software (Additional file 6). In this analysis, the 
values of mean likelihood obtained for the one- to ten- 
group models tested did not reach a plateau and Evanno s 
delta K criterion did not identify a peak for any of the 
K values tested. Moreover, for K values ranging from 2 
to 10, the entire set of 186 individuals was found to be 
admixed, with none being identified as a full member 
of a specific group. These patterns are typical of an un- 
structured population [38] and indicate the absence of 
a particular genetic structure at the scale of the FGB 
population. 

Spatial analysis of genetic diversity on chromosomes 

The mean value of Nei s diversity index (He) calculated 
for the 2,600 SNPs was 0.391 (SD = 0.127), while that for 



the 1,421 SNPs corresponding to mapped contigs was 
0.434, (SD = 0.067). These are very high estimates given 
the biallelic nature of these markers (the maximum Hg 
being 0.5 for a biallelic marker). We used the mapped 
markers to determine whether genetic diversity was 
equally distributed between the LGs (i.e. presence of 
LGs with lower or higher overall diversity. Additional file 7). 
A significant difference (P<0.05) between Hg values was 
observed. Tukey s HSD test showed that LGs could be 
classified into three groups, with lower (LG3-6, Hg = 0.419; 
SD = 0.072), medium (LGl-2-7-8-9-10-11-12, Hg = 0.434; 
SD = 0.066) and higher (LG 4-5, Hg = 0.449; SD = 0.059) 
levels of diversity. 

We then used a spatial statistics approach to determine 
whether the genetic diversity of the mapped markers was 
distributed non-uniformly along the chromosomes. We 
estimated the empirical variogram ofH^ (fp), to determine 
whether neighboring genes on the chromosome presented 
similar patterns of diversity. A spatially structured process 
would show an increase in variance with increasing map 
distance between markers. Based on all the gene loci 
from the composite map and map distances ranging 
from 0 to 10 cM, we found no particular relationship 
between fj^ and gene position on the composite map. 
Most of the calculated empirical variances fell within 
the area predicted by permutation (Figure 4). This was 
true for the individual LGs of the composite map and 
was confirmed for the component maps as well (data 
not shown). Thus, diversity at neighboring gene loci 
was not correlated with recombination distances in the 
study population and, with the marker density used, 
there is little evidence for extended reductions in di- 
versity due to selective sweeps. Given this result, we 
did not attempt to krige our data to detect hot or cold 
spots of diversity at a centimorgan scale. 
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Extent of intra- and inter-chromosomal linkage 
disequilibrium 

At least two SNPs were available in 248 EST-contigs for 
investigation of the pattern of physical LD. We consid- 
ered SNPs with a MAF>5%, resulting in the retention of 
714 pairs for the analysis. However, given the biased pro- 
cedure used to select SNPs in silico, the biased represen- 
tation of polymorphic sites within these contigs and the 
skewed distribution of distances between sites (half the 
pairs being at a distance < 250 bp), the observed pattern 
of short-distance LD (not shown) was not consistent 
with trends typically observed in conifers [39] based on 
amplicon sequencing. In addition, the estimate of the 
population experimental parameter (C) was negative, 
precluding any use of this data set for the further inter- 
pretation of physical LD over short distances. 

The pattern of long-distance (genetic) LD was exam- 
ined for the first time in this species, over the 12 chro- 
mosomes, on the basis of SNP markers localized on the 
composite linkage map and their genotypic profiles in an 
unstructured population. The distribution of the squared 
correlation coefficient for allelic frequencies (r^) showed 
that LD decreased rapidly over very short genetic dis- 
tances for all chromosomes (Figure 5; Additional file 8). 
However, we also identified 380 pairs (0.45% of the 
84,679 pair-wise combinations) for which the r^ was 
above the 0.1 critical level, while the genetic distance 
was different from 0 in the composite map. In order to 



verify whether these possible long distance LD (possibly 
due to epistatic selection) were not due to inaccurate 
map position resulting from the construction of the 
composite linkage map, we directly checked the map 
position of these pairs in the components maps. From 
these 380 pairs, 238 originated from the same compo- 
nent map, while 142 were from different component 
maps. From these 238 pairs, the genetic distance in the 
component map was equal to 0 cM for 102 pairs and 
comprised between 0 and 1 cM for 66 pairs, indicating 
that their position in the r^ plot was probably unreliable 
and therefore could not be used to infer long distance 
LD. An extreme case is provided for two outliers 
markers (F51TW9001DHGV3 and CT583376) in LG3 
placed 23 cM apart in the composite map, while they 
completely co-segregated in the component map (G2M). 
Thus, only 70 pairs (i.e. 238-102-66) were left to con- 
struct the distribution of long distance (i.e. non physical) 
LD. As rare allele frequency can influence LD, this dis- 
tribution (Additional file 9) was drawn based on 65 pairs 
(listed in Additional file 10) from which both markers 
had a MAF >20%. In cases where a ftmctional annota- 
tion was available, there was no similarity between a 
marker pair suggesting that these SNPs belonged to 
different genes rather than to different contigs of the 
same gene. In addition, 34 cases (highlighted in bold in 
Additional file 10) of such possible long distance LD could 
be confirmed by the fact that intragenic SNPs presented 
similar r^ values with SNPs in another gene. Finally, this 
distribution was used as a null model to test the signifi- 
cance of inter-chromosomal LD (potentially due epistatic 
selection). Each inter-chromosomal LD value was tested 
against the upper bound of this null distribution (signifi- 
cant if r^ > 0.32 at the 5% level). Given the number of tests 
performed, Bonferroni correction was applied to this 
upper bound (see the blue area in Figure 6). No significant 
inter-chromosomal LD was found in this population. 

Discussion 

Development of a composite map for maritime pine and 
genome-wide distribution of recombination 

Advances in next-generation sequencing and array-based 
genotyping technologies have lowered development times 
and costs for reliable single -nucleotide polymorphism 
(SNP) markers [40,41]. The availability of such markers 
has been a boon for the generation of high-density linkage 
maps in model and non model plant species, as recently 
demonstrated in sunflower [42], barley [43], tomato [44], 
and maize [45] . The integration of information from mul- 
tiple linkage maps for hundreds to thousands of markers 
is another chaUenge. One approach to the integration of 
information for multiple populations is to pool the 
genotypic data and minimize the sum of recombin- 
ation frequencies (or related metrics), as in the maximum 
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likelihood method [46] appUed to single populations, 
e.g. [47]. However, the computational time required 
for this approach may be prohibitive in some situa- 
tions [48,49] and this method is unworkable when 
genotypic data are unavailable. An alternative strategy 
involves integrating the linkage maps for separate pop- 
ulations without analyzing their genotypic data. Yap 
et al [50] were the first to model a map as a directed 
graph, with nodes representing mapped markers and 
edges defining the order of adjacent markers. They 
also designed an algorithm for merging maps from dif- 
ferent studies on the basis of loci common to different 
maps. Wu et al, [34] subsequently developed an algo- 
rithm based on graph theory implemented in Merge- 
Map, a program that has been used to construct 
several composite maps for barley [43,51]. Endelman 
[32] discovered that the graph linearization technique 
used by MergeMap was suboptimal and proposed a 
new approach to overcome this problem through 



linear programming. However, the software developed 
by Endelman [32], DAGGER, was unable to merge 
linkage maps with ordering conflicts. LPmerge, used 
for the first time on empirical data in the present 
paper, was designed to resolve ordering conflicts between 
component linkage maps and minimize errors between the 
composite map and the component maps [33]. By using this 
software we generated a composite map consisting of 1,838 
SNP markers distributed over 12 LGs, covering 1,712 cM. 
Map length was similar to that obtained for maps con- 
structed with similar numbers of loci in other conifer spe- 
cies: 2,083 cM in Picea glauca with 1,801 loci [25], l,898cM 
with 1,816 loci in Pinus taeda [30]. We then used this map 
to investigate the genome-wide distribution of recombin- 
ation. We found clear peaks for the number of markers. 
Their locations was consistent with centromeric and telo- 
meric regions, in agreement with previous findings in other 
species with a similar genome size such wheat, reporting 
that recombination was limited in these regions [52,53]. 
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Level and genome-wide distribution of genetic diversity 
in the first breeding population of maritime pine 

We presented a genome-wide map of genetic diversity (as 
estimated from expected heterozygosity, Hg) for a popula- 
tion resulting from mass selection in natural forests, with 
an estimated selection intensity of about 1.5 x 10"^ [54]. 
This population provided us a unique opportunity to 
study the effect of the first stage of domestication on the 
level and distribution of genetic diversity in a highly het- 
erozygous forest tree species. We showed that a selection 
intensity of this magnitude did not decrease the overall 
level of genetic diversity. 

Our findings are consistent with those of previous 
studies carried out with an handful of allozyme markers 
in breeding populations of Douglas fir [55] and Sitka 
spruce [56], and with a recent investigation based on 
SNP markers spanning the entire genetic map of white 
spruce [57], showing no decrease in genetic variation 
during the first stage of domestication of these highly 
polymorphic species. We can therefore conclude that 
mass selection applied at a regional scale (the Landes 
forest covers about 1 million ha in the southwestern 



France), even with very high intensity, did not appear to 
compromise the background neutral genetic diversity of 
the maritime pine base breeding population. Thus, the 
high level of genetic diversity found in the FGB popula- 
tion is consistent with a large randomly mating popula- 
tion, as typically found for outcrossing species. 

We found no significant spatial pattern of genetic di- 
versity in the maritime pine genome (at least at the cM 
scale). Such patterns would have been indicative of de- 
creases in diversity associated with loci underlying the 
variation of the target traits. However, given the rapid 
decay of LD in this species (within a few hundred bp on 
average), the marker density used was probably too low 
to capture any localized decline in heterozygosity, if any 
occurred around selected loci. 

These results contrast with the large reduction of gen- 
etic variability observed for the selected traits [58] be- 
tween the Landes natural forest and the base population 
of the breeding program (which includes the FGB popu- 
lation), particularly for growth. We can therefore con- 
clude that these markers are probably not functionally 
important with respect to these selection criteria, in 
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agreement with the lack of statistical association be- 
tween allelic variation and breeding values for height 
growth and stem straightness (data not shown). Further 
investigations will be required to identify SNPs in LD 
with target trait-QTLs. Such investigations could involve 
the genotyping of unselected trees from wild populations 
and the comparison of allele frequencies before and after 
mass selection, or tests of association between breeding 
values and marker genotypes, as illustrated in [57] for 
white spruce. Given the polygenic basis of complex traits 
subjected to breeding, such as height and radial growth 
[59], we anticipate that this second approach is likely to 
be successful only for well chosen candidate genes puta- 
tively involved in trait variation. 

The set of 2,600 SNP markers developed in this study 
will be used to assess genetic diversity in subsequent 
generations of the maritime pine breeding program. The 
maintenance of genetic diversity is not only essential to 
guarantee the adaptation of future improved varieties to 
ongoing climatic change [60], it is also of particular im- 
portance for plant breeding programs based on recur- 
rent selection, because the progress of selection is 
determined by the level of genetic variation within the 
population. 

Long distance LD pattern and consequences for association 
mapping and genomic prediction In maritime pine 

We scored 2,600 SNPs in a population of 186 unrelated 
trees selected on the basis of their performance in nat- 
ural forests of the Aquitaine region in southwestern 
France, for establishment of the first generation of the 
maritime pine breeding program. Markers for which 
intra-chromosomal LD was estimated covered the whole 
linkage map of this species, at a mean density of 1 
marker per 1.4 cM (1 cM ~ 12 Mb in maritime pine, 
[61]). Sampled genes were well distributed across the 12 
LGs of the composite map, with 78-115 genes per LG. 
As expected, high values of r^ were obtained only for 
physically linked polymorphisms, i.e. SNPs belonging to 
the same gene. No significant LD was found over larger 
distances. These results are consistent with population 
genetics theory for such an undomesticated, outcrossing 
species, and can be attributed principally to the large 
effective sizes of the unstructured populations found in 
most conifers (estimates of effective population sizes 
for maritime pine are presented in Additional file 11). 
Similarly, no significant epistatic LD was found be- 
tween unlinked loci localized on different chromo- 
somes. LD is a property of a given gene pool, but the 
convergence of our results with those of Eckert et al, 
[62] for Pinus taeda suggests a lack of LD between gen- 
etically spaced gene-based markers in conifer species 
characterized by the same type of reproductive regime 
and Ufe history traits. 



Our findings suggest that the initial mass selection 
used to form the base population of the maritime pine 
breeding program was not only successful in terms of 
the initiation of a program to develop improved varieties 
[58], but also efficient for the sampling of neutral genetic 
diversity from the Landes forest. Absence of inbreeding 
and cryptic population structure within the base popula- 
tion were also confirmed. The substantial level of poly- 
morphism detected in the FGB population renders our 
set of markers as a valuable tool for breeding applica- 
tions. Trees have long generation interval and breeding 
is therefore a slow process. The 2,600 SNPs developed in 
this study will be extended to test the utility of genomic 
selection (GS) approaches to reduce the breeding cycles 
of maritime pine, as suggested for Pinus taeda [63,64]. 
Then, favorable combinations of polymorphisms will be 
sought in manageable breeding populations with small 
effective sizes to trace QTLs by linked markers. The 
prospective of developing GS holds great promise to 
increase the genetic gain in traits of interest in these 
long-lived organisms and to accelerate their domesti- 
cation [65,66], while maintaining sufficiently high levels of 
genetic diversity to allow the selected trees to cope with 
major bio tic and abiotic disturbances. 

Given the lack of LD in this population and lack of as- 
sociations between markers and phenotypes, predictions 
based on SNP markers for selection would likely have 
very low reliability. In several simulation studies on do- 
mestic animal and trees, LD showed a significant effect 
on reliability of predictions from genomic prediction 
models [15,16]. For example, in cattle breeding, for gen- 
omic selection to be successful the level of LD was sug- 
gested to be greater than 0.2 [14]. When LD among the 
markers increased from 0.1 to 0.2, the reliability of gen- 
omic predictions increased by 0.14 (from 0.68 to 0.82) 
[67]. LD is population specific and is expected to 
change with recombination, genetic background of the 
population and effective population size. To exploit 
marker-tagged QTL-trait associations in GS, we are 
currently combining three -generation pedigrees of mari- 
time pine (FGB and successive Gl and G2 populations), 
where LD should be much higher compared to the base 
population. 

Conclusions 

We established a 1,712 cM linkage map of maritime pine 
with 1,838 SNP markers using for the first time a new 
map merging algorithm that integrates linkage maps 
from separate populations without any recourse to ori- 
ginal genotypic data. We found clear cold spots of recom- 
bination consistent with the centromeric and telomeric 
regions of metacentric chromosomes [68]. We then used 
an extended set of 2,600 SNP markers to describe histor- 
ical recombination, genetic diversity and genetic structure 
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within a mass-selected population of 186 unrelated ge- 
notypes. The genetic structure of this population was 
very weak and we found no evidence that artificial 
selection had decreased neutral genetic diversity. Con- 
sidering the map position of 1,671 of these 2,600 
markers (corresponding to 1,192 different loci) we 
found that LD mostly extend over short physical dis- 
tances as expected for an outcrossing species with 
large effective population size. 

At the dawn of a new paradigm in forest tree breeding 
[69-71], namely the implementation of GS [37], a range 
of factors that influences the accuracy of genomic esti- 
mated breeding values needs to be carefully considered, 
including the heritability of the traits, its genetic archi- 
tecture, the extent of genotype by environment inter- 
action, the genetic structure and the effective size of the 
targeted population, the number of records in the refer- 
ence population, the number of markers and their asso- 
ciated cost, and the overall prediction and validation 
strategy. The present study provides novel results that 
should be taken into account for the implementation of 
GS in maritime pine. The drop in the status number (as 
defined by [72]) from several hundred in the mass- 
selected population, to 94 in the second breeding popu- 
lation and 23 in the elite population of the new sub-line 
structure of the breeding population (A. Raffin personal 
communication) is a favorable situation for its further 
development in this species. 

Methods 

Genetic material, DNA extraction and genotyping assay 

The two mapping populations (G2 and F2) for which 
SNP-based linkage maps were merged in this study were 
described in [73]: G2 designates a three-generation out- 
bred pedigree (full-sib progeny), whereas F2 is a three- 
generation inbred pedigree. Chancerel et al [31] 
constructed male and female linkage maps from the G2 
population (G2M and G2F, respectively), and a single 
linkage map for the F2 population (Additional file 12). 
In addition, 194 trees from the base population of the 
maritime pine breeding program, referred to here as the 
"first-generation breeding" or "FGB" population, were 
used for genetic diversity and LD analysis. During the 
1960s, adult stands in the Landes forest (south western 
France) were explored and trees considered outstanding 
in terms of their stem volume and straightness were identi- 
fied. These trees were sampled across a wide range of 
different locations covering the Aquitaine region 
(Additional file 13), particularly along the Atlantic coast, 
and were at least 50 m apart when present at the same 
site. A phenotypic index was built from the performances 
of the candidate trees and their 20 closest neighbors [54], 
to select the base population. These trees were grafted 
and stored in clonal archives [58]. 



Young needles from each tree were harvested and 
stored at -80°C until DNA extraction and genotyping 
(Infinium assay, lUumina), as described in [31]. In total, 
9,279 SNPs (6,307 SNPs sensus stricto and 2,972 indels 
distributed in 4,613 different contigs) were individually 
inspected with Genome Studio software, using a GenCall 
score cutoff of 0.15 (according to Illuminas recommenda- 
tions) to detect failed SNPs. Loci for which two or three 
clusters (depending on the type of marker segregation) 
were identified without ambiguity were considered to be 
polymorphic markers. SNP clusters were modified manu- 
ally, to refine cluster positions, when necessary. SNPs and 
surrounding sequences were submitted to dbSNP (acces- 
sion numbers are listed in Additional file 4). Overall, 186 
out of the initial set of 194 trees presented genotyping in- 
formation for 2,600 SNPs (Additional file 14). 

Linkage map development 

We compared two different software packages to gener- 
ate a composite map from three existing SNP-based 
linkage maps (G2F, G2M, F2, [31]) of maritime pine: 
LPmerge [32,33], which is available as an R package [74] 
at http://cran.r-project.org/web/packages/LPmerge/, and 
MergeMap (http://www.mergemap.org/), which has been 
used in several barley mapping projects [34,51]. To com- 
pare both algorithms, the root-mean-squared error (RMSE) 
for each marker was calculated by comparing its position 
in the composite map with that on the individual linkage 
maps, and the average RMSE across the markers within a 
linkage group was used to assess the goodness-of-fit for the 
composite map. For LPmerge, the maximum interval par- 
ameter K was varied from 1 to 8, and the composite map 
with the lowest RMSE was selected. For both software 
packages, as few markers were common to G2F and G2M, 
we first generated two intermediate composite maps 
("F2+G2F" and "F2+G2M"). We then merged intermedi- 
ate maps into a final composite map. The merging of 
the three maps in a single step yielded the same marker 
order in the composite map (Spearman's rank R > 0.99, 
p = 2.2.10"^^ data not shown), but we present the two- 
step procedure here because this approach made it possible 
to compare LPmerge and MergeMap on three datasets 
("F2+G2F", "F2+G2M" and the combination of the two), 
making it possible to draw more general conclusions. 

Analysis of marker distribution on chromosomes 

We investigated whether the mapped genes were evenly 
distributed between linkage groups (LG), by comparing 
the observed and expected numbers of genes per linkage 
group in chi^ tests (P<0.05). The expected number of 
genes for each LG was obtained by multiplying the ratio 
"size of LG /total genome length" by the total number of 
mapped markers. We also analyzed the distribution of 
markers along the chromosomes, by using a kernel 
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density estimation to calculate optimal window size 
(bandwidth) for dividing the genome into blocks, in 
which we counted the number of genes. Kernel density 
estimation is a non-parametric technique for density 
estimation, in which a known density function (here a 
Gaussian function) is averaged across the observed data 
points to create a smooth approximation. The smooth- 
ness of the density approximation depends on the band- 
width. In our case, we used a fixed and robust bandwidth 
estimator [75], based on the algorithm of Jones et al [76]. 
Bandwidth values were calculated for each linkage 
group of the composite map obtained with LPmerge 
(Additional file 15). Compared to our first investigation 
based on the three component maps [31], we estimated 
here the variability of the kernel density estimator, by sam- 
pling randomly 70% of the total number of markers for 
each chromosome independently, 999 times without re- 
placement [77,78]. For each random sample, we calculated 
a kernel density estimate. For all the kernel density esti- 
mates (from 999 random samples), we then calculated 
both the 2.5 and 97.5 percentiles, to define the confidence 
interval of the kernel density estimate. We defined the 
lower and upper bound thresholds of significance, by ana- 
lyzing the marker distribution, by comparing (in a chi^- 
test) the observed distribution of the number of markers 
per bandwidth with that expected under a Poisson distri- 
bution. A lower bound threshold, defining a cold spot of 
recombination (i.e. a cluster of markers on the linkage 
map) was determined when the observed number of 
markers was greater than the expected value, while the re- 
sults of the chi^-test were significant. Similarly, to define a 
hot spot of recombination, an upper bound threshold was 
determined when the observed number of markers was 
lower than expected, while the results of the chi^-test were 
significant. Finally, we compared the position of the confi- 
dence interval of the kernel density estimator with these 
lower and upper bounds, to identify significant hot and 
cold spots, respectively. 

Population structure analysis 

Genetic structure and cryptic relatedness within the 
FGB population were assessed in three ways. First, we 
assessed the patterns of pairwise relatedness, calculated 
from the genotype matrix as described in [79]. Second, 
we tested for cryptic population structure by performing 
principal component analysis (PCA) on the genotypic 
matrix of 2,600 markers, as described in [80], removing 
the dependence between SNPs at the same locus [81]. 
The leading eigenvalues obtained by PCA were tested 
for significance, by comparing their size with that expected 
under a Tracy- Widom distribution [80,82]. Genetic clus- 
ters were created on the basis of Ward clustering of the 
calculated Euclidean distance from the significant PCs [81]. 
Significant PCs were averaged per geographic location 



(sampling site) and their relationship to geographic loca- 
tion was investigated by linear regression on the principal 
components calculated for the geographic coordinates. 
Genetic isolation by distance was determined as the correl- 
ation between Euclidean distance along the averaged gen- 
etic PCs and geographic (degree) distance. Significance was 
assessed in a Mantel test. Finally, a third analysis of genetic 
structure was carried out with the software Structure 
v2.3.3 [38,83] using mapped loci. This method assumes 
Hardy- Weinberg equilibrium for the tested population and 
unlinked or weakly linked loci are required for clustering 
analysis. Before carrying out this analysis of genetic struc- 
ture, we checked that the markers used were in Hardy- 
Weinberg equilibrium. Then, for a given EST contig, we 
chose a single SNP at random, to avoid the problem of LD 
between loci. Based on these criteria, we used a genome- 
wide set of 1,180 mapped SNPs for the genetic structure 
analysis. We carried out three runs of Structure for each 
tested number of groups (/<), from 1 to 10. The correlated 
allele frequency model with admixture was used, with 
burn-in and run-length periods of 2.5x10^ iterations. We 
used the mean likelihood L(/<) and Evannos delta K criter- 
ion [84] values obtained over three runs to determine 
whether an optimum value of K could be identified, as ex- 
pected when discrete populations are present in the data. 

Spatial structure of diversity on chromosomes 

A SNP diversity map was superimposed on the composite 
linkage map. We used the FGB population to test depart- 
ure from Hardy- Weinberg equilibrium and to estimate 
three genetic diversity parameters for each SNP: minor al- 
lele frequency (MAF), observed heterozygosity (Ho) and 
expected heterozygosity (Hg, Neis index of genetic diver- 
sity [85]. Raw data (SNP genotypes for each individual) 
were formatted with GenAlEx6 [86] and analyses were 
conducted with the GenePop package [87,88] available on- 
line at URL: http://genepop.curtin.edu.au/. Genetic diver- 
sity parameters were finally retrieved from the output of 
GenePop, using a PerlScript. As these three parameters 
were highly correlated, we considered only Hg. 

We first analyzed the spatial structure of diversity 
along the LGs of the composite map by variance ana- 
lysis, generating a statistic that can be used to assess the 
covariance (i.e. correlation) between a variable of interest 
(here, Hg) and the location at which it is measured (here, 
the position of SNP markers on the composite linkage 
map). The covariance calculated is equal to half the vari- 
ance of the differences in the value of a metric (Z) be- 
tween all pairs of points (i and j) separated by a given 
distance (h). This approach is often referred to as semi- 
variance analysis in geostatistical studies (but see [89] 
for the confusion between the terms variance and semi- 
variance). If pairs of points are closely located spatially 
and correlated, then they will have a low variance. The 
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underlying assumption is that the difference in diversity 
between any two markers is a function of the distance 
between these markers. 
The empirical variance (f^) was calculated as follows: 



Yh 



l/2Var[Zs-Zs.] =:^T.i^^-^^J 

^^^^ Nh 



with Nh = {ij : s-Sj = h], 

where 5/ and Sj are the map positions of two SNP 
markers, Z^. and Z^. are the values of their diversity sta- 
tistics (He) and Npi is the number of paired data (SNP 
markers) at a distance of /z (1 to 10 cM) or less. We cal- 
culated variance with a robust estimator, to avoid the 
influence of outliers, as described in [90,91]. We first 
estimated the empirical variogram for each LG inde- 
pendently, and then by pooling all the data across LGs 
to estimate a pangenomic variogram. We determined 
whether a particular value of the variance differed sig- 
nificantly from a random value, by carrying out permu- 
tation tests in which the Hg values associated with each 
SNP marker were randomized with respect to chromo- 
somal position. One thousand permuted data sets were 
generated and the probability of finding a value higher 
than the observed value for a distance class was calcu- 
lated from the distribution of the permuted data. We 
then determined whether diversity was equally distrib- 
uted between LGs (presence of LGs with lower or higher 
overall diversity). A simple one-way ANOVA was per- 
formed, followed by a Tukeys HSD test for multiple 
comparisons of means. This test compares the difference 
between the Hg values of each pair of LGs, with appro- 
priate adjustment for multiple testing. 

Extent of linkage disequilibrium (LD) on chromosomes 

LD between pairs of loci was estimated by the squared al- 
lele frequency correlation r^ [92], based on SNP markers 
located on the composite map. We used the Rogers and 
Huff approximation for loci with unknown phases [93]. 
LD was calculated for all pairwise marker combinations, 
both within and between chromosomes. The range of 
minor allele frequencies in the FGB population was simi- 
lar across LGs, ranging from 0.15 to 0.5, and it was as- 
sumed that this population was unstructured, as shown in 
the results section. 

We investigated the distribution of intra-chromosomal 
LD over physical and genetic distances. For the estima- 
tion of short-distance (i.e. physical) LD, SNPs from the 
same contig (discarded for linkage map construction) 
were reintroduced into the LD analysis and placed at the 
same map position as the marker initially selected for 
linkage map generation. Pairwise r^ values were plotted 
against the genetic distance between the two loci (starting 



at 0 cM for SNPs from the same contig). We then built a 
null model to test for the presence of inter-chromosomal 
LD, by retaining only genetically linked pairs (i.e. corre- 
sponding to two different contigs) with critical values 
of r^ > 0.1 [94]. 

At the intragene level, LD was estimated by the 
squared allele frequency correlation r^, based on pairs of 
SNP belonging to the same contig, with MAF>5%. Of 
the 4,911 contigs studied, 248 contained two or more 
SNPs and were retained for the intragene LD analysis. 
The extent of LD was estimated by nonlinear regression 
analysis on the basis of intragene r^ values [95]. The ex- 
pected values of r^ between pairs of adjacent sites (E(r^)) 
were estimated with the formula: 



10 + C 



_(2 + C)(ll + C) 



(3 + C)(12 + 12C + C2 



f7(2 + C)(ll + C) 



which is valid under drift recombination equilibrium 
and low mutation rate and can be adjusted for sample 
size [96]. In this formula, C is the population recombin- 
ation parameter (p = 4Ner where Ne is the effective 
population size and r is the recombination rate per site 
and per generation) and n is the sample size. We carried 
out nonlinear regression (nls function) with R software 
X, replacing C with C x distance (in bp) between pairs of 
sites, to fit this formula to our data. 
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Additional file 1: Alignments of the composite linkage maps 
obtained with LPmerge (LPM on the left) and MergeMap (MM 
on the right) software. 

Additional file 2: Comparison between LPmerge and MergeMap for 
composite map construction. The first table provides the two metrics 
for statistical testir^g, i.e. lir^kage group length of the composite map and 
root mean square error (RMSE), whereas the second table provides the 
result of the test. Two intermediate composite maps (G2F_F2 and 
G2M_F2) were constructed before the production of the final composite 
map (G2F_F2-G2M_F2). 

Additional file 3: a) Pairwise kinship relationships between 192 
individuals of the FGB population, showing 3 pairs of trees with 
identical genotypic information over the 2,600 SNPs, which were 
therefore considered to be mislabeled in the tree archive, b) 
Pairwise kinship relationships between the 186 individuals of the 
FGB population, i.e. excluding the three abovementioned pairs. 

Additional file 4: List of SNP markers with dbSNP accession 
numbers, corresponding contig ID in PineContig_v2, genetic 
parameters in the first-generation breeding population, and linkage 
group assignment on the component maps. 

Additional file 5: a) Plot of genetic PCI and PC2 and their 
relationship to the two geographic components, b) biplot of PCA 
against geographic coordinates, c) relationship between the first 
genetic and geographic PC (averaged per location), d) relationship 
between the second genetic and geographic PC (averaged per 
location), e) genetic distance (along the first two genetic PCs) as a 
function of geographic distance. 
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Additional file 6: Clustering of the 186 GO trees of the FGB 
population using the Structure software. Distribution of Evanno's 
delta K values (A) and example of barplots obtained with numbers of 
groups K varying from 2 to 5 (A). 

Additional file 7: Distribution of genetic diversity (He values) along 
the 12 linkage groups of the maritime pine composite map. Blue: 
one SNP in the contig, He value for the SNP; red: two SNPs in the same 
contig, He value for the second SNP; Green: three SNPs in the same 
contig, He value for the third SNP; Purple: four SNPs in the same contig, 
He value for the fourth SNP. 

Additional file 8: Plot of linkage disequilibrium, measured as the 
squared correlation coefficient of allele frequencies (r\ against 
genetic map distance (cM) between all marker pairs in each of the 
12 linkage groups (LG) of the maritime pine genome, r^ was 

determined with the GGT 2.0 program, from the polymorphism data for 
186 unrelated trees of the Aquitaine population. The 0.1 critical level of r^ 
was determined after Bobbins et al. (201 1). J Exp Bot, 62:1831-1845. 

Additional file 9: Distribution of long distance intra-chromosomal 
linkage disequilibrium (LD) as estimated by r^. This distribution was 
used as a null model to test the significance of inter-chromosomal LD 
potentially due epistatic selection. 

Additional file 10: List of 65 pairs of markers with l\/IAF>20% and 
associated linkage disequilibrium values (r^). 

Additional file 11: Estimates of effective population sizes. 

Additional file 12: Description of the three component maps from 
Chancerel et al. (2013). 

Additional file 13: Geographic origin of the GO trees. 

Additional file 14: Genotyping dataset (186 trees of the FGB 
population x 2,600 SNPs). 

Additional file 15: Bandwidth values (cM) obtained by kernel 
density analysis for the composite linkage map obtained with 
LPmerge. 
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